Invited talk: Phonological Models in Automatic Speech Recognition

نویسنده

  • Karen Livescu
چکیده

The performance of automatic speech recognition systems varies widely across different contexts. Very good performance can be achieved on single-speaker, large-vocabulary dictation in a clean acoustic environment, as well as on very small vocabulary tasks with much fewer constraints on the speakers and acoustic conditions. In other domains, speech recognition is still far from usable for real-world applications. One domain that is still elusive is that of spontaneous conversational speech. This type of speech poses a number of challenges, such as the presence of disfluencies, a mix of speech and non-speech sounds such as laughter, and extreme variation in pronunciation. In this talk, I will focus on the challenge of pronunciation variation. A number of analyses suggest that this variability is responsible for a large part of the drop in recognition performance between read (dictated) speech and conversational speech. I will describe efforts in the speech recognition community to characterize and model pronunciation variation, both for conversational speech and in general. The work can be roughly divided into several types of approaches, including: augmentation of a phonetic pronunciation lexicon with phonological rules; the use of large (syllableor word-sized) units instead of the more traditional phonetic ones; and the use of smaller units, such as distinctive or articulatory features. Of these, the first is the most thoroughly studied and also the most disappointing: Despite successes in a few domains, it has been difficult to obtain significant recognition improvements by including in the lexicon those phonetic pronunciations that appear to exist in the data. In part as a reaction to this, many have advocated the use of a “null pronunciation model,” i.e. a very limited lexicon including only canonical pronunciations. The assumption in this approach is that the observation model—the distribution of the acoustics given phonetic units—will better model the “noise” introduced by pronunciation variability. I will advocate an alternative view: that the phone unit may not be the most appropriate for modeling the lexicon. When considering a variety of pronunciation phenomena, it becomes apparent that phonetic transcription often obscures some of the fundamental processes that are at play. I will describe approaches using both larger and “smaller” units. Larger units are typically syllables or words, and allow greater freedom to model the component states of each unit. In the class of “smaller” unit models, ideas from articulatory and autosegmental phonology motivate multi-tier models in which different features (or groups of features) have semi-independent behavior. I will present a particular model in which articulatory features are represented as variables in a dynamic Bayesian network. Non-phonetic pronunciation models can involve significantly different model structures than those typically used in speech recognition, and as a result they may also entail modifications to other components such as the observation model and training algorithms. At this point it is not clear what the “winning” approach will be. The success of a given approach may depend on the domain or on the amount and type of training data available. I will describe some of the current challenges and ongoing work, with a particular focus on the role of phonological theories in statistical models of pronunciation (and vice versa?).

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Allophone-based acoustic modeling for Persian phoneme recognition

Phoneme recognition is one of the fundamental phases of automatic speech recognition. Coarticulation which refers to the integration of sounds, is one of the important obstacles in phoneme recognition. In other words, each phone is influenced and changed by the characteristics of its neighbor phones, and coarticulation is responsible for most of these changes. The idea of modeling the effects o...

متن کامل

A Study on the Use of Conditional Random Fields for Automatic Speech Recognition

Current state of the art systems for Automatic Speech Recognition (ASR) use statistical modeling techniques such as Hidden Markov Models (HMMs) and Gaussian Mixture Models (GMMs) to recognize spoken language. These techniques make use of statistics derived from the acoustic frequencies of the speech signal. In recent years, interest has been rising in the use of phonological features derived fr...

متن کامل

Borrowing Language Resources for Development of Automatic Speech Recognition for Low- and Middle-Density Languages

In this paper we describe an approach that both creates crosslingual acoustic monophone model sets for speech recognition tasks and objectively predicts their performance without target-language speech data or acoustic measurement techniques. This strategy is based on a series of linguistic metrics characterizing the articulatory phonetic and phonological distances of target-language phonemes f...

متن کامل

A Database for Automatic Persian Speech Emotion Recognition: Collection, Processing and Evaluation

Abstract   Recent developments in robotics automation have motivated researchers to improve the efficiency of interactive systems by making a natural man-machine interaction. Since speech is the most popular method of communication, recognizing human emotions from speech signal becomes a challenging research topic known as Speech Emotion Recognition (SER). In this study, we propose a Persian em...

متن کامل

FROM SPEECH SOUNDS TO SYMBOLIC REPRESENTATION – ABOUT NEW APPROACHES TO IMPOSE STRUCTURE ON SPEECH DATA During the last decades, phonetic and phonological knowledge about the structure of speech

During the last decades, phonetic and phonological knowledge about the structure of speech has provided a basis for the development of computational models for speech recognition, including automatic speech recognition (ASR). Since speech sounds are highly variable, the robustness of the mapping from a speech signal to its discrete symbolic representation is one of the most difficult problems i...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2008